The assignment was composed of three main parts: read coverage, variant calling and analytical performance. For each part, I report the questions that were asked, the strategy that I have used to answer them and the results that were generated.
In Next Generation Sequencing (NGS) studies, the term “coverage” describes the number of times that a nucleotide or base in the reference is covered by a high-quality aligned read. A reference can be a whole genome, a specific chromosome or targeted genomic regions and the number of reads that align to, or “cover” it is known as redundancy of coverage and also as the depth or the depth of coverage (Sims et al. 2014). Coverage has also been used to denote the breadth of coverage of a target genome, which is defined as the percentage of target bases that are sequenced a given number of times. For example, a genome sequencing study may sequence a genome to 30X average depth and achieve a 95% breadth of coverage of the reference genome at a minimum depth of ten reads.
In this assignment, the reference was a region of chromosome 19 of the human genome hg19 and the first question was:
What can you say about the quality of the read coverage in the sample?
First, to reduce biased coverage estimates, I have identified and removed read pairs that were likely to have originated from duplicates of the same original DNA fragments through some artifactual processes during sample preparation (e.g., library construction using PCR). For the given sample, ~25% of the aligned reads were estimated to be duplicates and were removed using Picard MarkDuplicates (Broad Institute 2018). The removal of duplicate reads is expected to reduce non-uniformity of coverage and thus, increase the power to detect variants (Kozarewa et al. 2009). The data should also be corrected for patterns of systematic errors in the base quality scores which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it’s important to correct any systematic bias observed in the data. However, because in part 3 of the assignment it was asked to compare between variant calls (i.e., ground truth vs. variants called by the pipeline), I have performed the base quality recalibration step at a later stage of the analysis. I have integrated the statistics of analytical performance (sensitivity, precision and specificity) with information from the recalibration step in order to give an additional view on potential reasons for the difference between the two variant calls.
After filtering out duplicate reads, I have calculated both the redundancy and breadth of coverage for high-quality bases (\(\geq\)Q20) from high-quality alignments (\(\geq\)MapQ30), within the region of chromosome 19 of the human genome hg19 using SAMtools (Li et al. 2009) and deepTools (Ramı́rez et al. 2016). The genomic interval of chr19 started at position 60,004 and ended at position 14,992,498, for a total length of 14,932,495 bases.
The breadth of coverage or the percentage of the chr19 genomic region that was covered by, for example, at least ten reads was 15%. Thus, the selected interval of chr19 was overall poorly covered (low breadth of coverage) and on average, there were 13 reads per base pair, meaning an average of 13X coverage (Fig. 1). For single sample sequencing, which is the case of this assignment, an average of 30-35X coverage has been the standard minimum for generating a reliable variant call (Ajay et al. 2011; Sims et al. 2014). Given that the overall fraction of the chr19 region was scarcely covered and that the average coverage was much lower than the current standard, the variant calling output for the given sample should be treated with caution. I expect the variant calling output to contain variants that are not present in the ground truth (false positives) as well as to lack variants that are instead present in the ground truth (false negatives). I expect, of course, to find true positives and true negatives using this pipeline but it will be less sensitive, less precise and less specific at calling variants than the pipeline used for generating the ground truth.
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Read coverage redundancy table
redu <- read.table("results/1_read_coverage/chr19_cov_redundancy.txt")
# Reorder
redu <- redu[order(redu$V2), ]
# Percentage of bases covered by at least 10 reads
(sum(redu[redu$V2>=10, 1])/length(60004:14992498))*100
## [1] 15.06177
# Fraction of bases covered
redu$prop <- redu$V1/sum(redu$V1)
# Select values of read coverage for the figure
brks <- c(0:10, 15, seq(20,100,length.out = 9), 125, seq(150, 350,by = 50))
red_bin <- redu[redu$V2 %in% brks, ]
# Plot fraction of bases against read coverage without the zero coverage
ggplot(data = red_bin[-1,], aes(x = as.factor(V2), y = prop)) +
geom_histogram(stat = "identity") +
theme_bw() +
labs(x = "Coverage (# reads per bp)", y = "Fraction of bases") +
theme(axis.title = element_text(size = 16), axis.text = element_text(size = 11)) +
annotate("text", y = 0.038, x = 25, label="Mean = 14X", hjust=1, vjust=1, size = 6)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
# Read coverage per base data (output DeepTools plotCoverage)
deepT <- read.table("results/1_read_coverage/chr19_coverage.txt")
deepB <- read.table("results/1_read_coverage/chr19_coverage_target.txt")
# Prepare data for plotting breadth against depth of coverage
deepT <- merge(x = deepT, y = deepB, by = c("V1", "V2", "V3"))
colnames(deepT) <- c("chr", "start", "end", "xxx", "intarget", "outarget")
# Count number of bases with the same depth
deepC <- apply(X = deepT[, -1:-3], MARGIN = 2, FUN = function(y) {
aggregate(x = deepT$chr, by = list(y), length)
})
library(purrr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Merge list of dataframes
ad <- deepC %>% reduce(left_join, by = "Group.1")
colnames(ad) <- c("depth", "xxx", "intarget", "outarget")
# Replace NAs with 0
ad$outarget <- ifelse(test = is.na(ad$outarget), yes = 0, no = ad$outarget)
# Calculate "reverse cumulative sum"
calc_rev <- function(dat, coln, dep_val) {
orow <- dat[dat[, 1] %in% dep_val, ]
for (i in 1:nrow(orow)) {
if (as.numeric(orow[i, 1])==0) {
orow[i, paste(coln, "perc", sep = "_")] <- (orow[i, coln]/sum(dat[, coln])) * 100
} else {
orow[i, paste(coln, "perc", sep = "_")] <- (sum(dat[dat[,1]>=as.numeric(orow[i, 1]), coln])/sum(dat[, coln]))*100
}
}
return(orow[, paste(coln, "perc", sep = "_")])
}
ad$xxx_perc <- calc_rev(dat = ad, coln = "xxx", dep_val = ad$depth)
ad$intarget_perc <- calc_rev(dat = ad, coln = "intarget", dep_val = ad$depth)
ad$outarget_perc <- calc_rev(dat = ad, coln = "outarget", dep_val = ad$depth)
# Plot
library(plotly)
fig <- plot_ly(ad[ad$depth<=200,], x = ~depth)
fig <- fig %>% add_trace(y = ~xxx_perc, name = 'sample XXX', type = 'scatter', mode = 'lines')
fig <- fig %>% add_trace(y = ~intarget_perc, name = 'intarget', type = 'scatter', mode = 'lines')
fig <- fig %>% add_trace(y = ~outarget_perc, name = 'intarget', type = 'scatter', mode = 'lines')
fig
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Figure 1. Read coverage at the region of chromosome 19. Left panel: overall mean coverage and fraction of bases of the found read coverages, but not for all the values. Right panel: fraction of the chr19 region or breadth of coverage that was covered with a ceratin number of reads.
One important point must be made and that is about the statistic that it is often used for predicting the quality of the variant calling analysis. The mean of a continuous variable is mostly meaningful when the values are normally distributed and the variance is homogeneous. In NGS studies where (e.g., Prabhu and Pe’er 2009; Yoon et al. 2009) coverage follows a normal distribution and it is homoscedastic, the mean is then a valid proxy for depth of coverage and thus, for predictive the overall power of variant calling. The same criterion applies in studies where coverage is high and uniform across the sequence of reference However, coverage data is rarely normally distributed nor homoscedastic suggesting that the mean is not a very appropriate proxy for depth of coverage and thus, may neither be strongly predictive for the reliability of the variant calling output.
The count of the number of overlapping reads tells us how many bases are covered how many times.
it was due to the fact that many of my target regions exhibited high sequence similarity to other regions of the genome. Thus, reads that should have gone to my target regions were being ‘robbed’ by other regions of similar sequence. This, coupled with other target regions that had relatively ‘unique’ sequence, results in a distorted coverage profile much more exaggerated than the profile you’ve posted. It isn’t necessarily a problem and I would say that your data is fine. However, set a high mapping quality in order to ensure that your reads have high probability of mapping to the regions that you expect the to.
The transition/transversion ratio (Ti/Tv ratio) is the proportion of the variants observed as transitions (between purines, or between pyrimidines) versus transversions (between purines and pyrimidines). The Ti/Tv ratio is particularly useful for assessing the quality of single nucleotide polymorphisms inferred from sequencing data [31, 32]. A higher ratio generally indicates higher accuracy [27] (Jiang et al. 2019).
SNPs as presented in public databases like dbSNP (32,33) or HapMap (34) are germline variations for which at most population frequencies are known. In literature it is usually assumed that the variation should be found in more than 1% of the population in order to be called a SNP. Such information is very useful for biomarker development since it describes the prevalence of the mutation in different populations. However, it is normally not possible to get additional information (like gender, age, or disease status) on the individuals having the SNP, only the population a person belongs to is given. Since it is not known if the information comes from a tumor or normal sample, a correlation between diseases and SNPs cannot be calculated.